De-Duplication Scheduling Strategy in Real-Time Data Warehouse

نویسندگان

  • Hui Liu
  • Jie Song
  • Yu-Bin Bao
چکیده

Data quality of the data warehouse is crucial to decision-makers. Data duplication is considered one of the critical factors that affect the data quality. Therefore, data de-duplication is an essential process for data warehousing. Particularly, for a real-time data warehouse, it is necessary to ensure not only the data quality in real-time, but also the performance of the front-end queries and analysis. The scheduling strategy of de-duplication in a real-time data warehouse should be well studied. In this paper, we firstly investigate the three kinds of data de-duplication scheduling strategies named De-duplication Prior scheduling Strategy (DPS), Real-time scheduling Strategy (RS) and ETL Prior scheduling Strategy (EPS); then propose a new Time-Triggered scheduling Strategy (TTS) which belongs to EPS; finally evaluate the performance of the proposed scheduling strategy through experiments. This work is contributed to the efficient data cleaning and application of real-time data warehouse.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improvement of the Analytical Queries Response Time in Real-Time Data Warehouse using Materialized Views Concatenation

A real-time data warehouse is a collection of recent and hierarchical data that is used for managers’ decision-making by creating online analytical queries. The volume of data collected from data sources and entered into the real-time data warehouse is constantly increasing. Moreover, as the volume of input data to the real time data warehouse increases, the interference between online loading ...

متن کامل

Green Energy-aware task scheduling using the DVFS technique in Cloud Computing

Nowdays, energy consumption as a critical issue in distributed computing systems with high performance has become so green computing tries to energy consumption, carbon footprint and CO2 emissions in high performance computing systems (HPCs) such as clusters, Grid and Cloud that a large number of parallel. Reducing energy consumption for high end computing can bring various benefits such as red...

متن کامل

Predicting Maximum Data Staleness in Real-Time Warehouses

This paper presents an analysis technique for estimating maximum data staleness in a data warehouse that collects “near-real-time” data streams. Data is pushed to the warehouse from a variety of external sources with a wide range of inter-arrival times (e.g., once a minute to once a day). In prior work, ad hoc heuristic algorithms have been proposed for scheduling warehouse updates. In this pap...

متن کامل

Real-time Scheduling of a Flexible Manufacturing System using a Two-phase Machine Learning Algorithm

The static and analytic scheduling approach is very difficult to follow and is not always applicable in real-time. Most of the scheduling algorithms are designed to be established in offline environment. However, we are challenged with three characteristics in real cases: First, problem data of jobs are not known in advance. Second, most of the shop’s parameters tend to be stochastic. Third, th...

متن کامل

Integrated Order Batching and Distribution Scheduling in a Single-block Order Picking Warehouse Considering S-Shape Routing Policy

In this paper, a mixed-integer linear programming model is proposed to integrate batch picking and distribution scheduling problems in order to optimize them simultaneously in an order picking warehouse. A tow-phase heuristic algorithm is presented to solve it in reasonable time. The first phase uses a genetic algorithm to evaluate and select permutations of the given set of customers. The seco...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015